A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction

نویسندگان

Wei-Yun Ma

Keh-Jiann Chen

چکیده

Statistical methods for extracting Chinese unknown words usually suffer a problem that superfluous character strings with strong statistical associations are extracted as well. To solve this problem, this paper proposes to use a set of general morphological rules to broaden the coverage and on the other hand, the rules are appended with different linguistic and statistical constraints to increase the precision of the representation. To disambiguate rule applications and reduce the complexity of the rule matching, a bottom-up merging algorithm for extraction is proposed, which merges possible morphemes recursively by consulting above the general rules and dynamically decides which rule should be applied first according to the priorities of the rules. Effects of different priority strategies are compared in our experiment, and experimental results show that the performance of proposed method is very promising.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards a Hybrid Model for Chinese Word Segmentation

This paper describes a hybrid Chinese word segmenter that is being developed as part of a larger Chinese unknown word resolution system. The segmenter consists of two components: a tagging component that uses the transformation-based learning algorithm to tag each character with its position in a word, and a merging component that transforms a tagged character sequence into a word-segmented sen...

متن کامل

Chinese Unknown Word Extraction by Mining Maximized Substrings

The issue of identifying out-of-vocabulary (OOV) words is a major difficulty in Chinese word segmentation. We address this issue by applying a very efficient algorithm for extracting maximized substrings (Shen et al., 2013) from a large-scale raw text, which form a list of unknown word candidates. We then apply techniques such as Short-term Store and Lexicon-based Voting to reduce the noises in...

متن کامل

Cascade Markov random fields for stroke extraction of Chinese characters

Extracting perceptually meaningful strokes plays an essential role in modeling structures of handwritten Chinese characters for accurate character recognition. This paper proposes a cascade Markov random field (MRF) model that combines Preprint submitted to Elsevier 29 September 2009 both bottom-up (BU) and top-down (TD) processes for stroke extraction. In the lowlevel stroke segmentation proce...

متن کامل

A Fast Algorithm of Address Lines Extraction on Complex Chinese Mail Pieces

A fast and efficient method is presented to extract address lines on both machine printed and handwritten Chinese mail envelopes. The algorithm is based on a bottom-up approach. First, we select out text blocks from connected components (CCs) and immediately group the text blocks into the initial lines. Then, the average text block features are computed to validate the initial text lines and gu...

متن کامل

A Hybrid Model for Chinese Word Segmentation

This paper describes a hybrid model that combines machine learning with linguistic and statistical heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two major components: a tagging component that annotates each character in a Chinese sentence with a position-of-character (POC) tag that indicates its position in a word, and a merging com...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction

نویسندگان

چکیده

منابع مشابه

Towards a Hybrid Model for Chinese Word Segmentation

Chinese Unknown Word Extraction by Mining Maximized Substrings

Cascade Markov random fields for stroke extraction of Chinese characters

A Fast Algorithm of Address Lines Extraction on Complex Chinese Mail Pieces

A Hybrid Model for Chinese Word Segmentation

عنوان ژورنال:

اشتراک گذاری